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^ ■ Abstract 

p, 

yj-^ , We develop a method to carry out MAP estimation for a class of Bayesian 

f— ^ ' regression models in which coefficients are assigned with Gaussian-based spike 

and slab priors weighted by Bernoulli variables. Unlike simulation-based in- 
ference methods, the proposed method directly optimizes the logarithm of the 
joint posterior density for parameter estimation. The corresponding optimiza- 
tion problem has an objective function in Lagrangian form in that regression 
coefficients are regularized by a mixture of squared I2 and /q norms. A tight 
approximation to the /q norm using majorization-minimization techniques is 
derived, and a coordinate descent algorithm in conjunction with a specified 
soft-thresholding scheme is used in searching for the optimizer of the approxi- 
mated objective. Simulation studies show that the proposed method can lead 
to more accurate variable selection than other benchmark methods. It also 
shows that the Irrepresentable Condition (Zhao and Yu, 2006) appears to have 
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less impacts on the performance of the proposed method. Theoretical results 
further show that under some regular conditions, sign consistency can always 
be established, even when the Irrepresentable Condition is violated. Results 
on posterior model consistency and estimation consistency, and an extension 
to parameter estimation in the Generalized Linear Models are provided. 

Keywords: MAP estimation; Iq norm; Majorization-minimization algorithms; 
Irrepresentable Condition. 

1 Introduction 

Consider the following regression model 

Yi = Xii/3i + Xi2/32 H h Xip/3p + ei, (1.1) 

where Yi is the response variable for the ith subject, Xij is the j'th covariate for the 
ith subject, Pj is the corresponding regression coefficient, and e^ is the error term 
following some specified distribution. Variable selection in regression problems has 
long been considered as one of the most important issues in modern statistics. It 
involves choosing an appropriate subset S of indices {1, 2, ■ ■ ■ , p} so that for j G S, 
the covariates Xjj's and estimated coefficients /3j's are scientifically meaningful in 
interpretation, and estimates y^/ = '^j(zs^i'jf^j have relative good properties in pre- 
diction. A general approach to tackling variable selection problems is via utilizing a 
criterion in which the objective function is subject to a constraint on the number of 
covariates IS*]. The well-known information-based criteria such as AIC and BIG are 
of this kind. For small p, these criteria can be calculated by fitting models with all 
possible 2^ combinations of the covariates. However, when the number of covariates 
increases, the number of candidate models increases exponentially. For large p, it is 
virtually impossible to calculate these criteria for all 2^ candidate models. 

Several approaches have been developed by modern Bayesian statistics to tackle this 
problem. One approach is to use stochastic search methods. The methods aim to 



relax the combinatorial difficulty by stochastically exploring the high posterior prob- 
ability regions in the model space. A well-known example is the Stochastic Search 
Variable Selection (SSVS) proposed by George and McCuUoch [16|. The SSVS as- 
signs a mixture prior weighted by Bernoulli variables to each regression coefficient, 
and then applying a Gibbs sampling procedure to sample the posterior probability on 
the Bernoulli variable. The resulting Monte Carlo average over the Bernoulli samples 
can be seen as an estimate for the posterior inclusion probability of the covariate, 
and variable selection can be done by either evaluating the posterior probability with 
a given threshold or evaluating point estimates of regression coefficients via hypoth- 
esis testirig or creditable interval construction. A general review of SSVS can be 



found in 17|]. Recent researches in Bayesian variable selection using MCMC-based 



inference procedures focus on providing full Bayesian solutions to variable selection 



problems formulated by frequentists. These include the Bayesian Lasso 



35|,|20| and 



the Bayesian Elastic Net |26i]. Other relevant approaches include 19[. A major ad- 
vantage of MCMC-based inference procedures is that they provide a practical way 
to assessing posterior probabilities, and inference tasks such as variable selection or 
point estimation can be straightforwardly carried out based on posterior probability 
calculation. However, as the number of covariates p becomes quite large, MCMC- 
based inference procedures may become time-consuming. In addition, convergence 
of MCMC-based sampling algorithms is not often guaranteed. Other approaches 
developed in modern Bayesian statistics to tackle variable selection problems in- 
clude those based on shrinkage-thresholding procedures. Empirical Bayes methods 
[l5| , |5| , |2J] aim to estimate regression coefficients by ffist applying numerical pro- 
cedures to obtain a crude, non-zero estimate for each coefficient, and then using 
post-processing procedures to discard those with small values or those with large 
variances. On the other hand, the maximum a posteriori (MAP) approach aims 
to estimate regression coefficients by directly maximizing the joint posterior density 
function. The Bayesian logistic regression model with Laplace priors developed by 
Genkin et al. [141] is of this kind. When the number of covariates p is large, the MAP 



approach relies on the use of efficient search algorithms for parameter estimation. 



In a Bayesian variable selection setting, these algorithms often involve iteratively 
applying shrinkage-thresholding steps to obtain estimates that have sparse features, 
i.e. some of them have exact zero values. In this sense, parameter estimation and 
variable selection can be achieved simultaneously under the MAP approach. In addi- 
tion, with suitable algorithms, variable selection under the MAP approach can also 
be fast and efficient. It is particularly explicit in the situation in which the number 
of covariates p is large but the number of covariates with non-zero coefficients is small. 



Recent frequentists' approaches to variable selection focus on applying the idea of 

regularization estimation in the situation in which the number of variables is much 

arger than the number of samples. These include the SCAD 9|], 48|, the Elastic net 



47| , J49|] , [ll| , the Adaptive Lasso [46|, the Grou p L asso |42|],[30|, the Dantzig Selector 
2|], the Relaxed Lasso 3l|, and MC_|_ penalty 4^. All these approaches can either 
been seen as alternatives or as extensions of the Lasso estimation [371] • In addition, 
these approaches appear to have corresponding Bayesian interpretations. For exam- 
ple, the Adaptive Lasso can be interpreted as the one which assigns Laplace priors on 
regression coefficients, with the scale parameter estimated by using prior knowledge 
of the OLS estimates or the ordinary Lasso estimates. Theoretical results provided 



by Knight and Fu 25|] show that with regular conditions on the order magnitude of 



the tuning parameter, the Lasso is consistent in parameter estimation. However, as 



shown by Meinshausen and Biihlmann 32|, and Zou (46|, for the Lasso estimation, 
consistency in parameter estimation does not imply consistency in variable selection. 
Further conditions on the design matrix and tuning parameter should be imposed 
to ensure consistency in variable selection for Lasso type estimations. Zhao and 
Yu 451] established the Irrepresentable Condition and showed that the Lasso can be 



asymptotically consistent in both variable selection and parameter estimation if the 
Irrepresentable Condition holds and some regular conditions on the tuning parame- 
ter are satisfied. The same condition was also established by Zou 
Lin [431. 



46| and Yuan and 



In this paper we develop a method to carrying out MAP estimation for a class 
of Bayesian models in tackling variable selection problems. The use of MAP es- 
timation in variable selection problems had previously been studied by Genkin et 



al. IJ] in a logistic regression setting. The key difference between our approach 



33j, i.e. 



and Genkin et al.'s is that our model assigns a Mitchell-Beauchamp prior 
the Gaussian-based spike and slab prior weighted by Bernoulli variables, on each 
regression coefficient. Conventionally, parameter estimation for this model relies on 
MCMC or other simulation-based methods. Recent studies by Ishwaran and Rao 



23j, 22|] used a rather different approach in that regression coefficients are estimated 
via OLS-based shrinkage methods. In our MAP estimation, an augmented version 
of the posterior joint density at logarithm scale is derived. From frequentists' point 
of view, the MAP estimation is equivalent to the regularization estimation with a 
mixture penalty of squared I2 and Iq norms on regression coefficients. In practice, 
we apply a majorization-minimization technique to modify the penalty function so 
that convexity for the objective function can be achieved. We further construct a 
coordinate-descent algorithm based on a specified iteration scheme to obtain the 
MAP estimates. Simulation studies show that using the MAP estimates can lead 
to better performances in variable selection than those based on other benchmark 
methods in various circumstances. Moreover, theoretical results show that the MAP 
estimator is asymptotically consistent in variable selection even when frequentists' 
Irrepresentable Condition is violated. 

The paper is organized as follows. Section 3 focuses methodological aspects of the 
proposed method. Section 4 provides two simulation studies on performances of the 
proposed method. Section 5 develops relevant asymptotic analysis for the method. 
Section 6 extends the method to parameter estimation in the Generalized Linear 
Models. Real data examples are provided in Section 7. Discussions and concluding 
remarks are given in Section 8. 



2 Notations 

Let X be an n X p design matrix. The ijth entry of X, {X)ij, is denoted by Xij, 
and the zth row of X is denoted by Xj. The transpose of X is denoted by X'^. Let 
y = {yi,y2, ■ ■ ■ ,yn), and (3 = (/3i,/32, ■ ■ ■ ,/3p)- Here y is a reahzation of random 
variable Y = {Yi,Y2,--- ,Yn). Denote Ipxp the p x p identity matrix. For a p- 
dimensional vector a = (ai, a2,- ■ ■ , ctp), define the li norm by ||a||i = X]?=i I'^jl' ^^^ 
I2 norm by ||a||2 = (^^=1 kiP)"*^^^! the /qo norm by ||a||oo = maxj |aj|, and the /q 
norm by ||aj||o = X]f=i ^('^i 7^ 0)' where I(aj 7^ 0) is an index variable such that 
l{aj 7^ 0) = 1 if aj 7^ 0, and l{aj 7^ 0) = otherwise. The probability density of the 
variable /3 conditional on 9 is denoted by /(/3| 6). For non-zero valued coefficients 
in /3, denote 5* the corresponding index set. Finally, we define the sign function for 
variable /3 by 



/ 



sign(/3) 



1 if /3 > 

-1 if/3<0 (2.1: 

if /3 = 



3 The method 



We start by formulating the regression model ( II. ip under a Bayesian framework. 

Note that in a regression model a covariate can only be selected if its coefficient is 

estimated with a non-zero value. Based on this result, we assign an index variable 

7j = I(/3j 7^ 0) to each covariate so that if [3j 7^ then, 7^- = 1 and Xj'yj(3j = Xj(3j, 

and if f3j = 0, then 7^- = and Xj'yj(3j = 0. With 7^, the regression model fll.ip has 

an equivalent representation given by 

p 
Yi = ^Xij-fjl3j + ei. 
i=i 

From a variable selection point of view, the index vector 7 = (71,72, ■■ ■ ,7^) is an 
indicator for candidate models. Different candidate models will have different values 
in 7. 



3.1 The Bayesian formulation 

Under a Bayesian framework, we assume 



it 



Yi\ /3, 7, cr^ ~ Normal I \^ ^ijljl^ji '^^ ] for i = 1, 2, ■ ■ ■ , n, 

/3,|c^^7.■,A ~ 7,Normal(0,cr2A~i) + (l-7,)I(/3, = 0) for j = 1, 2, ■ ■ ■ ,p, 
^^ I ''"1)^2 ~ Inver se- Gamma (ti, r2), 

7j| K ~ Bernoulli(fi:) for j = 1, 2, ■ ■ • ,p. (3.1) 

The prior on /3j implies that conditional on 7^ = 0, /3j is equal to with probabil- 
ity one, and conditional on 7j = 1, /3j follows a normal distribution with mean 
and variance a^A"^. The prior on /3 is the same as the spike-slab prior proposed by 
Mitchell and Beauchamp [33|. In addition, given fixed a^, the hyperparameter A will 
play a crucial role in controlling the concentration of Pj. The prior will gradually 
concentrate its mass on /3j = as A — )■ 00, and will gradually disperse its mass if 
A — > 0. It implies that information the prior can provide is dependent on A, and as 
A decreases, the prior will become less informative. Moreover, since 7^ G {0, 1}, the 
mixture form of the prior allows us to express it as Normal(0, cr^A"^)'^^ x^iPj = 0)^^'''^. 
This representation will be used in deriving the joint posterior density of the param- 
eters. The variance a^ has prior mean E(cr^| ti,T2) = T2/(ti — 1) and variance 
Var{a^\ Ti,T2) = t|/[(ti — 1)^(ti — 2)] given that Ti > 2. The hyperparameter k is 
the prior probability of including a covariate in the regression model. The Bernoulli 
prior on 7^ says that if only prior information is available, 7j will have probability k, 
to be 1 and 1 — k to be 0. 

Under Bayesian model (13. ip . the joint posterior density of /3, 7 and a^ can be ex- 
pressed as 

/(/3,7,a2| X,y,X,n,T2,K) oc f{y\ X,a', /3,^)fi(3\ a',^,X)fia'\ ri,r2)/(7| «:)• 

(3.2) 

With (13. 2p . we can estimate (/3,7,o"^) via various inference methods. In this paper 
the MAP (maximum a posteriori) method is adopted. Formally the MAP estimator 



for (/3, 7, 0"^) is defined by 

(^7,5^^) = arg mill {-2 log/ (/3, 7,0-^1 X, y, A, n, ts, k)}, (3.3) 

i.e. the minimizer of the minus 2 logarithm of the joint posterior density. The minus 
2 logarithm of the joint posterior density can be explicitly expressed by 

-21og/(/3,7,a2| X,|/,A,ri,r2,fi;) = -^^ ivi -^Xij-fjf3jj 

+ ^ + (n + 2ri + 2)loga2 

+ 2^ 7, log I ^^^^^ 1 + const. 

(3.4) 



j=i 



Here we have used an equivalent representation Normal(0, A '^)"'^ x I(/3j = 0)^ "^j for 
/(/3j| A, 7) given that 7^- G {0, 1}. Note that in (I33D the term T.%^\ogl{Pj = 0)^"^^ 
vanishes since for every j, jj = 1 implies I(/3j = 0) = 0. In turn, (1 — 7^) logI(/3j = 
0) = ■ 00 = 0. On the other hand, 7^ = implies I(/3j = 0) = 1, and in turn 
logI(/3j = 0) = logl = 0. 

Given fixed a^, the function (13. 4p has some meaningful interpretations in terms of 
regularization estimation on /3. For practical purposes, we fix a^ and multiply (13. 4 p 
with 0"^ in the following discussion. Note that since 7j/3| = /3|, the second term in 
( 13. 4p can be seen as a squared I2 norm on (3 weighted by A. We write it by A||/3||2. 
The squared I2 norm is smooth over MF, the domain of /3. Since it is not singular 
at the origin, it can not shrink /3 to zero in parameter estimation. In addition, as 
13 j increases, the penalty value under A||/3||2 will increase quadratically. It means 
that regression coefficients with larger absolute values will be penalized heavier than 
those with smaller absolute values. The undesired property leads to biased estima- 
tion in regression coefficients. The squared I2 norm, however, has a good property 
in least squares-based estimation when the number of variables p is larger than the 
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number of samples n. It provides an extra source for basis expansion in least squares 
estimation. It is important in the p > n situation in which the Gram matrix X^X 
is not of full rank, and the ordinary least squares technique is impossible to perform. 
With the squared I2 norm on (3, a set of pseudo basis can be added to the row of 
the design matrix, and the full rank condition can then be satisfied. A well-known 
application of this technique is ridge regression, for which the least squares estimate 
can be stably calculated even in the situation in which p is much larger than n. 

For the fourth term in (13. 4p . note that by definition 7j > 0, and the term X]f=i7i 
can be seen as an li norm on the vector 7. We can write the fourth term in f l3.4p 
by PA,K,cr2|l7lli5 where Pa,k,(t2 = cr^log[27ra^A^^(l — k)^/?^^]. Note that as k increases, 
Pa,k,o-2 will decrease. It implies that a strong belief in the presence of a variable will 
decrease the penalty value for the variable. In addition, by definition 7j = I(/3j 7^ 0), 
and the term ||7||i can be seen as an Zq norm on /3. To see how it can be, note that 
II7II1 = Yyj=i \K(^j 7^ 0)1 = liHis-fO X]^=i /^l' which is the Iq norm by definition. Here 
we have used the assumption that 0° = 0. We can express the fourth term in (13. 4 p 
as Pa,«;,,72||/3||o- 



3.2 Parameter estimation 

Now given all other parameters fixed, the MAP estimator of a^ can be derived by 
first making a derivative of (13. 4p with respect to o"^, then setting the derivative to 
zero, and solving the equation for cr^. The estimation of f3 is then carried out given 
0"^ is fixed. With fixed a'^ and the regularization interpretations given above, (13. 4 p 
has an equivalent representation given by 

HP; A, pa,.,.2) = I |y - x/3| I^ + A| |/3| I^ + p^^^^^2 1 1/3| |o + const, (3.5) 

Note that here we have multiplied (13. 4p with cr^. Now with (13. 5p . we can construct 
an iteration scheme to obtain (13. 3p . At the (m + l)th iteration, the iteration scheme 



is defined by 

-2N(m+l) 



MoM\2 , \ Y^P ;^(™)r«(™)^2 



\(J 






^(m+i) ^ argminL(/3;A,p;,,,^(52){m+i)), 

^(m+i) = (^/jj^+i) ^ 0), I(g'"+') ^ 0), ■ ■ ■ , I(;5j™+i) ^ 0)). (3.6) 

Note that the objective function (13.51) involves an Iq norm, which by definition, is 
not convex. Therefore related optimization tasks in the second term of (13.61) require 
some refinements. Here we adopt a relaxation approach to tackling the optimization 
problem. We begins the approach by noting that, mathematically the /q norm on a 
p-dimensional vector /3 can be expressed as 

||/?||„=limfMl±AW, (3.7) 

,3^0^ log(l+^3 ) 

which can be verified by seeing (13. 7p as a function of rs and using I'Hopital's rule. A 
more detailed discussion on the properties of the log-sum function in the right hand 
side of (13. 7p will be given later. Now we only focus on using it in solving optimization 
problems involving /q constraints. With representation (13. 7p . the objective function 
(13. 5p can be re-expressed as 

i^(/3;A,PA,.,.0 = lk-^/3|l2 + A||/3||^ + Pw4li^nE^^r7rT^ 

l-a^opt log(l + r3 ) J 

(3. 

If r3 is small enough, the log-sum function in the right hand side of (13. 8p will give an 
approximate representation of 1 1/3| |o. Graphical representations for the log-sum func- 
tion with different t^ and their mixtures with the squared I2 norm can be found in the 
top panel of Figure [TJ In addition, since the log-sum function in (13. 8p is continuous 
on /3, the combinatorial nature of 1 1/3| |o is relaxed. However, the term log(l + r3""'^|/3j|) 
is neither convex nor concave on (5j G M, and replacing 1 1 /3 1 1 with (13. 7p in (13. 5 p will 
still make the whole objective function (13. 8p remain non-convex. To tackle this prob- 
lem, a majorization-minimization algorithm is adopted. Majorization-minimization 
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(MM) algorithms 2l|], 4l| are a set of analytic procedures aiming to tackle diffi- 
cult optimization problems by modifying their objective functions so that solution 
spaces of the modified ones are easier to explore. For an objective function g{6), the 
modification procedure relies on finding a function h(9; 6^^'>) satisfying the following 
properties: 

h{e;e^^'^) > g{e) forall^, 
/,(^(0;^(0) = ^(^(0). (3.9) 

In (13.91) . the objective function g(9) is said to be majorized by h{6]6^^^). In this 
sense, h{6]6^^'>) is called the majorization function. In addition, (13.91) implies that 
h{6] 9^'-^) is tangent to g{9) at 9^^\ Moreover if ^('+^) is a minimizer of h{9; 9^^^), then 
(13.91) further imphes that 

^(^(0) = /i(^(0. ^(0) > /,(^('+i); ^W) > ^(^('+1)), (3.10) 

which means that the iteration procedure 6'^'^ pushes g{9) toward its minimum. 

Now we turn back to the function in the right hand side of (13.81) . Note that, since 
log(^) is a concave function of 9 for 9 > 0, therefore the inequality 

log(^) < log(0') + ^ - 1 (3.11) 

holds for all > and 9' > 0. Note that the right hand side of (13.111) is convex in 9. 
In addition, if we let 9' = 9, then (13. lip becomes an equality, which implies that the 
right hand side of (13. lip satisfies the properties stated in (13. 9p . therefore is a valid 
function for majorizing log(^). 

Proposition 3.1. Define p^^ = l/log(l + r.^-^) and let L'{(3; X, Px^^^a^) be the same 
as Ii3.5\) but without the constant term. Then L'{I3; X,Px^K,a'^) can be majorized by the 
following function: 

L"{(3; A,pa,k,.2,/3') = ||y - X/3\\l + A||/3||2 + px,.,.2h{/3; /3'), (3.12) 

where 

h{/3;/3') = hm 0.3^^1 log (l + r3-i|/3;|)+^;]^-l). (3.13) 

11 



Proof of Proposition [Xil Assume Z?*^'"*"-^) minimizes L"{I3] A, p\^K,a'^^ (^') given (3' = (3^^\ 
Then with (13. 7p and the inequahty (13. Ill) , the quantity ||/5^'^^^||o can be bounded in 
a way such that 






J=l \ IPJ 

p 



< hmp.3 5^1og(l + r3-i|/3f|) 



= P^'^llo, (3.14) 

which verifies the first condition stated in (13.91) . For /?' = /3, h{(3; 13') is equal to 
the log-sum function in (13. 7p . which verifies the second condition stated in (13.91) and 
completes the proof. D 

A graphical representation of using MM algorithms in approximating the log-sum 
function in (13. 7p can be found in the bottom-left panel of Figure [H From the argu- 
ment given above, we can construct an iteration scheme to minimize L{P] \iP\^K,a^)i 
with the /o norm, or equivalently the log-sum function, replaced by h{[3; f3') defined 
in Proposition 13. 1[ For example, in (13. 6p . /3("*+^) can be obtained by carrying out 
the following iteration scheme: 



over index /, where (f)j = limT-3_>o[log(l + T^^)i\l3j \ + ts)]^^. The proce- 

dure of using iteration scheme ( I3.15P in solving the problem of minimizing (13.51) . 
or equivalent (13. 8p . is called the BAVA-MIO (BAyesian VAriable selection using a 
Majorization-mlnimization apprOach), and the corresponding minimizer is called the 
BAVA-MIO estimator. 

Note that the last term in the right hand side of (I3.15P is a linear combination 
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of (pP |/3j|, a convex function of /3j, therefore given \\y — X(5\\2 + A||/3||| is con- 
vex, the whole objective function in f l3.15p will be convex, which guarantees that the 
iteration scheme will converge. In addition, the minimizer fl3.15p can be solved by 



using the coordinate descent algorithm proposed by Friedman et al. 12|. In practice, 
the coordinate descent algorithm is based on iteratively cycling a one-dimensional 
soft-thresholding scheme. At the {mi + l)th iteration, the soft-thresholding scheme 
for the jth coordinate is given by 

. n X -1 . n ^(™i)\ 

^(..+1) ^ l^^^j ^ ^j 5T(^X^x.,ri-.),p.,,..^j (3.16) 

where f-™^- = yi - Ylj'^j ^ij'P'p^ with m[ = mi + 1 for j' = 1, 2, ■ ■ • , j - 1, and 
m'l = mi for j' = j + 1, j + 2, ■ ■ ■ ,p. Here ST{a, b) is a soft-thresholding operator 
defined by ST{a,b) = sign(a)(|a| — 6)+. A detailed derivation of fl3.16p is given in 
Appendix A. 

3.3 Choosing hyperparameters 

Choosing appropriate hyperparameters for prior construction is an important issue 
in many Bayesian inference problems. For hyperparameters present in the model 
(13. ip . we consider the triple (A,ri,r2) first. One principle we adopt in parameteriz- 
ing the hyperparameters is that as the number of samples n increases, the impact of 
the hyperparameters in posterior inference will become less significant. For example, 
in the marginalized likelihood, we wish to see Xt2 — )■ as n — )■ oo. In this sense, 
we may let Xt2 oc {p\ogp)/n. In addition, we let Ti = T2 + 1, so that the prior 
expectation of o"^ is equal to 1. Given these conditions, one of the possible choices 
is (A,ri,r2) = {l/y/n,p\ogp/y/n + l,p\ogp/y/n). We will will this setting in the 
simulation study in the later section. 

Now we consider the prior inclusion probability k. In some circumstances data- 



driven empirical Bayes approaches are proposed 15|], while in other circumstances 



full Bayesian methods that assign priors on k are proposed [27[. Unlike previously 
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proposed approaches, in which single point estimates were obtained for n, we adopt 
an approach by specifying a feasible region for the function 

^Ij{k) = ha"" log(27ra2A-^(l - fi:)^//?^)]^^. (3.17) 

Here 0*^"-' = <\)a for j = 1,2, ■■ ■ ,j9 since the term (\)^^^ uses initial values /3 = 
for j = 1, 2, ■ ■ ■ ,p. The function ^(k) is the threshold used in the soft-thresholding 
operator fl3.16p . We carry out the inference procedure with values in the feasible 
region and look for which values of ^/^(k) lead to the best performance measured by 
criteria such as the one used in ten fold cross validation or the Bayes factor. Under 
this approach estimated parameters can be seen as functions of k on the feasible 
region. Given different values of k, curve-like paths for estimated parameters can 
be obtained. The main reason we adopt this "whole-path" fitting strategy is that 
the optimization procedure may get stuck at some stationary points. To explain 
this, note that we need an initial value 0^- to run the iteration scheme fIS.lSp . By 
definition, (\)- is a function of /3- , which by definition, is a function of '?/'(«;). As 



pointed out by Candes et al. [3] and Mazumder et al. 28|, different 0^- may lead 



to different solutions for the minimizer. Under this situation a global minimum may 
not be guaranteed. By using the strategy given above, we can run the iteration 
scheme (I3.15P with a large number of possible values of 0^- , therefore eliminating 
the possibility that the solution is stuck at some local minima. 

In addition, the reason we do not directly specify a feasible region for k but rather 
the function '0(fi;) is because the threshold '0(k) is non-linear in k. If the function 
had been derived from a specified region for k, there would have been wide and non- 
equal gaps between discretized values of "^{k). As implied in the soft-thresholding 
operator representation f l3.16p . a large positive or negative increment in '^{^k) might 
lead to a sudden decrease or increase in the number of selected covariates. One way 
to stabilize the estimation procedure is to steadily decrease (or increase) the value 
of '0(k). 
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Moreover, although we only specify the feasible region for ip{K,), we can still ob- 
tain K, by using the following calibration strategy. To begin the strategy, we fix o"^ 
and A first. Note that the minimum requirement to include a covariate in the model 
is that maxj | '^^=iXijyi\ should be greater than ip{n). Assume maxj | X]r=i'''u^«l ^^ 
upper bounded by constant C*. A feasible region for ■?/'(«;) is defined as [0,C*]. For 
an arbitrary c* G [0, C*], the corresponding n* for ip{K.*) = c* is given by 

K* = — , (3.18) 

1 + exp(c**/2) ^ ^ 

where c** = 2c*/((j^0*-°'') — log27rcr^A~^. In practice, we discretize the feasible region 
[0, C*] into equal spaces and then use the discretized values to calculate k* according 
to f l3.18p . Our approach is the same as the one using a fixed grid on the tuning 
parameter for parameter estimation. This fixed grid approach to tuning parameter 
selection has been adopted in [lJ|, [l2|, and 30[, and is advocated by 411 ^^^l 1I| 
for fast and accurate parameter estimation. 

3.4 A toy example 

Here we provide a toy example to illustrate the BAVA-MIO estimation. We let 
the number of samples n = 100 and the number of covariates p = 1,000. For 
regression coefficients (3 = (/3i,/32, ■ ■ ■ ,/3iooo), we let (325o = 2, /^soo = -3.2, /3750 = 
-1.25, /3iooo = 5.44, and /3j = for j e {1,2,- •■ ,1000} \ {250,500,750,1000}. 
We simulate each row of X independently identically from MVN(0, Jpxp), and then 
calculate Y = Xf3 + e with e ~ MVN(0,/nxn)- For the hyperparameters, we let 
Ti = 0.2plog{p)/y/n + 1, T2 = 0.2p\og{p)/^/n and A = I/a/ti. Further let T3 = 10~^. 
We use 100 equal spaced points to form a grid for \E'(/t;). We perform two BAVA-MIO 
estimations: one uses the Bayes factor and the other uses ten fold cross validation 
for tuning parameter selection. We define the Bayes factor between models Ais' and 
-Ms by 

BF{Ms', Ms; y) = -jr-i TT' (3-19) 

f{y i,n,T2,K,X) 
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where the term f{y\ 7, ri, r2, k, A) refers to the marginahzed hkehhood with /3 and 
0"^ being integrated out with respect to their prior probabihty measures. For the 
Bayesian model stated in (13. ip . the marginahzed hkehhood has a closed form repre- 
sentation given by 

/(y|7,n,r„.,A) = \^-,xjXs + ir'rin) ^[ 2 

[(n+|5|-l+2ri)/2] 



(y'^{X-'XsX]: + Iny\ + 2r2) 



In subsequent sections we will use the measure f l3.19p for variable selection. In addi- 
tion, for all variable selection tasks using fl3.19p . the baseline model Ais will always 
refer to the null model. 

The results are shown in Figure [2l The path plot in the top left panel of Figure 
[2] shows that non-zero coefficients entered into the model earlier under the BAVA- 
MIO estimation. In addition, the paths of estimated coefficients behave similar to 
those under the hard-thresholding estimation, that is, once a coefficient is estimated 
to be non-zero, the corresponding estimation path makes a sharp jump to the non- 
thresholded value. Moreover, due to the presence of the squared I2 norm in the 
objective function, the number of selected covariates can be larger than the number 
of samples. Throughout the estimation procedure, the maximum number of selected 
covariates is 831, which is much larger than the number of samples n = 100. Here 
we also provide the Lasso estimation for regression fitting with the same data. The 
results are shown in the right panel of Figure |2l As compared with the Lasso estima- 
tion, in which 33 covariates are selected using ten fold cross validation, the BAVA- 
MIO estimations using the Bayes factor and ten fold cross validation correctly select 
covariates with non-zero coefficients. In addition, as shown in the bottom panel of 
Figure [21 values of the non-zero coefficients are also estimated more accurately under 
the BAVA-MIO estimations. 
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3.5 Some remarks on the log-sum function 

We briefly discuss properties of tlie log-sum function stated in i\3.7\i . First, note 
that log(l + Tg^^'^l/^jl) = log(r3 + \(3j\) — log(r3). By multiplying —1 to the sum 
X]f=i log(''"3 + \t^j\) ^iid let Ts — )■ 0, one obtains the logarithm of the product of l/|/3j| 



over j = 1, 2, ■ ■ ■ ,p. As pointed out by Tipping [38|, the term l/\/3j\ is an improper 

version of Student's t density. A rather different way is to see the log-sum function 

^j=ilog(l + ''"s'^l/^jD as a product of logarithm of the generalized Pareto density, 

which has a parametric form given by 

1 / as^z-ai)^ ""'"^ 
Pgp{z) = — 1 



(22 V '^2 

for z G (ai,oo), ai G (—00,00), 02 G (0, 00), and a^ G (—00,00). By multiply- 
ing —2 and adding a constant term —plogr^ to X]j=ilog(l + t^'^K^jI), it becomes 
log Y[^=i ''"3~^(1 + l/^jl/'^s)^^) which is a logarithm of the product of generalized Pareto 
densities with location parameter ai = 0, scale parameter 02 = T3, and shape pa- 
rameter as = 1. 

The following two propositions discuss relationships between the log-sum function 
and the Iq and li norms. The first one states that the error rate between the log-sum 
function and the /q norm measured by an /i distance is of order — logrs as T3 -^ 0. 
Proofs of the two propositions will be given in Appendix B. 

Proposition 3.2. Define gi{l3j;Ts) = log(l + r3"^|/3j|)/log(l + t^^) and 5'2(/3j) = 
I{Pj 7^ 0). Then for t-^ G [0, 1), there exists a positive constant Ci such that 

\\gi{P;r,)-g,{P)\l<cJ^-\. (3.20) 

A graphical representation of Proposition 13.21 can be found in the bottom-right panel 
of Figure [1] The next proposition states that the log-sum function can do better 
in approximating the Iq norm than the /i norm as t^ — )■ 0. On the other hand, 
results in this proposition also implies that the log-sum function approaches to li 



norm as ts — )■ 00. Sriperumbudur et al. 36|] gave another heuristic argument for this 
property. 
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Proposition 3.3. With the same notation used in Proposition 2, for Pj ^ and 
s G [0, 1], we have 



lim "^7;^^ = |/3,|-^ 



and 



giiPf 'Ts) _ i« ,1-. 



Therefore, limr^^o gi{f3j]T3) = I{f3j ^ 0) and lim^3^oofi'i(/3i; ^3) = |/3j| 



3.6 Connection with other approaches 

First we make some remarks on the use of the In norm in reeularization estimation. 

n Q n 

Fuchs [13[, Donoho et al. [7|, and Tropp [39'] independently showed that under some 
regular conditions, regression coefficients estimated with the /q constraint can be 
approximated by those estimated with the l\ constraint. The advantage of using 
the l\ norm instead of the /q norm as a constraint on regression coefficients is that 
minimization with an l\ norm constraint is a convex optimization problem while the 
minimization problem with an Zq norm constraint is combinatorial in nature. 

However, as shown by Fan and Li 9|, the l\ norm tends to provide larger penalty 
values to large coefficients and smaller penalty values to small coefficients. There- 
fore large coefficients tend to be biased estimated while zero-valued coefficients tend 
to be estimated with non-zero values. On the other hand, the /q norm provides 
equal penalty values to all coefficients, therefore is more likely to shrink small co- 
efficients to zero and keep large coefficients unchanged. Candes et al. |3J proposed 
a reweighted l\ approach for sparse recovery. They showed that the /q norm can 
be better approximated by the sum of some log functions than the conventional l\ 



norm. Sriperumbudur et al. [36| further explored the idea and used a modified 
log-sum function to approximate the /q norm in solving sparse generalized eigenvalue 
problems related to principal component analysis, canonical correlation analysis, and 
Fisher's discriminant analysis. 
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We have noticed that the use of binary indicators for variable selection has been 



studied by Yuan and Lin under the name of non- negative garrotte estimator 43 1. 
However, the estimation procedure associated to the non-negative garrotte estimator 
is quite different from the BAVA-MIO estimation proposed in this paper. In Yuan 
and Lin's proposal, the non-negative garrotte estimation is carried out via a two 
stage procedure. In the first stage, least squares estimation is proposed to obtain an 
initial estimate for each regression coefficient. In the second stage, a soft-thresholding 
estimation is proposed to obtain an estimate for the binary indicator associated to 
each regression coefficient. Under the soft-thresholding estimation, the estimate for 
the binary indicator is continuous on the interval between and L In this sense, the 
non-negative garrotte estimation can be seen as a shrinkage estimation on the least 
squares estimate. 

We have also noticed that the objective function stated in (13.151) is similar to the 
one used to obtain the Adaptive Elastic Net recently developed by Zou and Zhang 



49j. However, the weights used in the Adaptive Elastic Net objective are fixed while 
in (13.151) the weights are iteratively changed throughout the optimization procedure. 
In addition, Zou and Zhang did not see the /i norm with adaptive weights as an 
approximation to the /q norm. 

4 Simulation study 

In this section we conduct two simulation studies. The first one is a general test on the 
performance of the BAVA-MIO estimation. The second one focuses the performance 
of the BAVA-MIO estimation under various situations in which the Irrepresentable 
Condition may or may not hold. 
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4.1 Simulation study I 

In the first simulation study we compare the proposed method with other estima- 
tion approaches by fitting regression model Y = X/3 + e with data generated from 
different simulation schemes. Here Y and e are n-dimensional vectors, X is an n x p 
matrix of n x p matrix, and /3 is a p-dimensional vector. We assume each entry in 
e is i.i.d. from Normal(0, o"y), and each row in the design matrix X is i.i.d. from 
MVN(0, Ex). Throughout the whole simulation study, we let p = 120, and gener- 
ate Pj from a standard normal distribution for j = 1, 2, ■ ■ ■ ,10 and let (3j = for 
j = 11, 12, ■ ■ ■ , 120. That is, we have 10 non-zero and 110 zero coefficients in the 
"true" model. In addition, we use different values of (Sx,o"y,n) in generating the 
design matrix X and the error term e. We apply three different Sx to generate the 
design matrix. The ffist one has an independent structure with diagonal terms equal 
to 1 and off-diagonal terms equal to 0. The second one has a covariance structure 
such that (Sx)ij = 1 for i = j and {^x)ij = 0.5 for i ^ j. The third one has a 
Toepolitz structure such that (Sx)ij = 0.5'*"-''. Now define the signal-to-noise ratio 



by SNR = a/IE(/3^Sx/3)/o"y. We consider ay = 10, 1 and 0.2 in generating the error 
term e. These values correspond to SNR = 1,3.16 and 7.07, respectively. By using 
X, P and e, the response vector Y is calculated hy Y = X/3 + e. For the number of 
samples, we consider five values n = 40, 80, 120, 160 and 200. With three different 
structures for Sx, three different values for ay, and five different values for n, we 
have total 3 x 3 x 5 = 45 simulation experiments. 

We let hyperparameters (A,ri,r2) = {l/^/n,p\ogp/^/n + l,p\ogp/^/n) for the cases 
of SNR = 3.16 and 7.07. For the case of SNR = 1, we use 

/I — corr\ fj) 

A = \/-logp 

\ corr J \ n 

/ ' \ / I \ cofr/ log n 

I corr \ Iv iog V\ 
V 1 — corr / \ \fn 



1/1 \ l+cofr/logra 

,, . Uv}2^\ , ,4.i; 
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where coir is an average over the top 10 percent absolute values of the sample corre- 
lations between response Y and covariates X. Note that the prior guesses on ay = o"^ 
and cr^ = (T^/A under the Bayesian model (13. ip and parametrization (14.1 p are given 

by 

T2 f^~ CO?!' \ P log P 



ncr'\n,T2] = ^ . , 

Ti — 1 \ corr / n 

E[cr^| Ti,T2] f cofr \ Ap" 



A \ 1 ~ corr J \ n 

As cofr — )■ 1, the prior expectation of cry approaches to zero and the prior expecta- 
tion of al approaches to oo. For tuning parameter selection, we use two procedures: 
the Bayes factor, which is defined in f l3.19p . and ten fold cross validation. The cor- 
responding estimators are called BMIO-BF and BMIO-CV, respectively. 



We also carry out three other estimation approaches for comparisons. The first 
one is the Lasso |37|, which is defined by 

Aasso = argmm{||r-X^||2 + A||/3||i}. 

We use R package "glmnet" to obtain the Lasso estimates. The tuning parameter A 
is selected using ten fold cross validation. The second approach is the Relaxed Lasso 



311], which is defined by 



3reiaxo = argmin{||r-X/3||^ + 0A||/3||i}. 



We use R package "relaxo", which is the companion software to [3l|], to obtain the 
Relaxed Lasso estimates. The tuning parameter A is selected with ten fold cross 
validation with 100 values equally spaced in [0,1]. The third approach is the 



Adaptive Lasso [46|, which is defined by 

^adaiasso = arg miu !\\Y~Xf3\\l + \y2—]- 

We use R package "parcor" to obtain the Adaptive Lasso estimates with the default 
setting of |/9iassoj| as the initial value for Wj and tuning parameter A selected via ten 
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fold cross validation. 

We collect several performance measures at each simulation run. The first one is 
the standardized I2 distance between /3 and Ame, which is defined by 



l2-dis{/3) 



l^j=l[Pj Ptruej, 



2 



\ EJ.!/?, 



2 
true J 



The second one is the predictive mean squared error of (3 for a test data set, which 
is defined by 



PMSE(/3) 



Z_^i=l [-^i,tcstP ~ -^i,tcstPi 



'trUGJy 



'^test 

The test data set contains ntest = n x 10 data points generated using a simulation 
scheme the same as the training data set. The third one is the number of coefficients 
with non-zero estimated values |S'o|, where 5*0 = {j : (3j 7^ 0}. The final one is the 
sign function-based false positive rates, which is defined by 

g_ppj^ = #{j e So : sign(/3j) ^ sign(/3toue,i)} 

I Q I ' 

Pol 
where the sign function sign(.) is defined in (12.11) . 

For each of the 45 simulation experiments, we generate 100 runs to collect the 
four performance measures. We then plot the average of each peformance mea- 
sure against the ratio ra/l^ol, i.e. the ratio between the number of samples and the 
number of true coefficients with non-zero values. These plots are shown in Figures 
[3l m and [5] for SNR = 1,3.16 and 7.07, respectively. From the three figures we can 
see none of the estimation approaches can dominate the others in all aspects of the 
performance measures. In most cases, BAVA-MIO based estimations have smaller 
sign function-based false positive rates, as shown in the second column of each fig- 
ure. It implies that more accurate variable selection may be done using BAVA-MIO 
estimations. In general, BAVA-MIO estimations have fewer numbers of non-zero es- 
timates, as shown in the fourth column of each figure. These findings become more 
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significant as the number of samples increases. In addition, since the BAVA-MIO 
estimation using the Bayes factor has the fewest numbers of non-zero estimates than 
other estimation approaches, it is surprising that the PMSE and /2-dis measures 
under the BMIO-BF estimation are comparable to those under other estimation ap- 
proaches, for example, in cases with SNR = 3.16 and in some cases with SNR = 1. 
However, we also noticed that the BMIO-BF estimation has higher values in the 
PMSE and Z2-dis in the cases with SNR = 7.07, particularly in those with small 
numbers of samples. Figure [6] show heatmaps for the S-FPR and rankings of the 
S-FPR for the five estimation approaches under the 45 simulation scenarios. The 
heatmaps are generated by using the graphical software GAP (Generalized Associ- 



ated Plots), which was developed by Wu, Tien and Chen J40[ as companion software 
to ^ (http://gap.stat.sinica.edu.tw/Software/GAP/index.htm). The GAP- 
based heatmaps further suggest that using BAVA-MIO estimation can lead to more 
accurate variable selection. 

4.2 Simulation study II 

In the second simulation study we investigate the impact of the Irrepresentable Con- 
dition on the performance of BAVA-MIO estimation in variable selection. Before 
stating the Irrepresentable Condition, we give some notation definitions first. We as- 
sume the true model is parametrized by 70. Define Sq = {j : 70 j- = 1, for some j G 
{1, 2, ■ ■ ■ ,p}} and Sq = {1,2,- ■ ■ ,p} \ Sq. Denote f3so the coefficients with indices 
in 5*0, and Ps" the coefficients with indices in Sq. Similar definitions are also applied 
to Xso and X^c, respectively. An estimator /3(n) is said to be sign consistent in esti- 
mating f3 if the probability of the event {sign.(/3{n)) = sign(/3)} approaches to 1 as 
n -^ 00. Given the sign consistency holds, the estimated index set 5*0 = {j : 7o,j = 1} 
will be the same as the true index set 5*0, therefore the sign consistency implies vari- 
able selection consistency, that is, asymptotically with probability one, non-zero 
valued coefficients will have non-zero estimated values, and zero-valued coefficients 
will be estimated with zero values. 
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Zhao and Yu [45!] showed that if the Lasso estimation wants to achieve the sign 
consistency, then the design matrices Xs and Xs^ must satisfy the following condi- 
tion: 

\\XlXs,iXlXs,r'sign{Pso)\U < 1, (4.2) 

where Pso is the vector of non-zero valued coefficients. The condition fl4.2p is called 
the (Weak) Irrepresentable Condition. If the Irrepresentable Condition ( 14. 2 p fails to 
hold, then the sign consistency will never occur even when n — )■ oo. An intuitive way 
to explain the Irrepresentable Condition is to see the quantity XjcXso{Xg^Xso)~^ 
as a least squares estimate for the regression X^g on Xs^. In this sense, the Irrepre- 
sentable Condition states that the total amount of coefficients for the regression Xgc 
on Xso should not exceed 1, i.e. X^c is "irrepresentable" in terms of Xs^. 

Here we conduct a simulation study for the investigation. The simulation scheme is 
given as follows. We simulate 100 design matrices in which each row is i.i.d. from 
MVN(0, Ex), with Sx ~ Wishart(Jpxp,P, p) and p = 30. This setting is similar to 
the one used in Zhao and Yu's study. Corresponding regression coefficients /3 are 
generated in a way that the first 5 entries of /3 are i.i.d. from Normal(0, 1), and the 
rest of 25 entries are set to 0. Note that for some pairs (X, /3), the Irrepresentable 
Condition fl4.2p will hold, but for some it will not hold. Zhao and Yu defined the 
Irrepresentable Statistic by 

Irr.stat = 1 - \\X^.XsMsoXso)-'sign{/3s,)\\o. 

The Irrepresentable Condition is considered to be violated if Irr.stat is smaller than 
zero. We calculate the Irrepresentable Statistic and carry out 100 simulation runs 
for each pair (X, f3). In each run we generate n = 100 data points. We use ay = 0.05 
to generate the error term e, which is corresponding to SNR = 10. We then fit the 
regression model using the Lasso estimation and the BAVA-MIO estimation with 
these data points. For each pair (X, /3), we calculate the model selection probabil- 
ity P{So = So) based on counting the times of whether the estimated sign vector 
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matched the true sign vector throughout the whole regularization paths. 

We also carry out the same simulation experiment for SNR = 5, 2, and 1. The 
scatter plots in Figure [7] show the estimated model selection probability against the 
Irrepresentable statistic under different signal-to-noise ratios. From these plots we 
can see that performances of the BAVA-MIO and Lasso estimations are deteriorated 
when the signal-to-noise ratio is decreasing. However, we also found in some circum- 
stances the BAVA-MIO estimation can achieve high model selection probabilities 
even when the Irrepresentable Condition is violated, i.e. the Irrepresentable statis- 
tics are smaller than zero. In Section 5 we will provide theoretical results to explain 
this phenomenon asymptotically. The second and fourth panels in Table [T] show the 
squared correlations between the estimated model selection probability and the Ir- 
representable statistic. The squared correlations for the BAVA-MIO estimation are 
relatively small in compared with the Lasso estimation. 

5 Asymptotic analysis 

In this section we derive asymptotic results for the BAVA-MIO estimator. The first 
one give a theoretical explanation for the invariance of the BAVA-MIO estimator 
under the Irrepresentable Condition. The second one deals with the posterior model 
consistency related to the hierarchical Bayesian formulation stated in fl3.ip . and the 
third one states posterior estimation consistency of the BAVA-MIO estimator. 

Before give these results, we give some notation definitions first. Define y" = 
{yi,y2,--- ,yn)- The notation y"' emphasizes the fact that the number of entries 
in the observed response vector y is n. We use the same definitions given in Section 
4.2 for 5*0, 5*0, (3so, /^sg, -^So ^^^ Xs^- Further denote J^s the model characterized 
by the index set S. Under a Bayesian framework, Ms usually refers to the sampling 
density, and in this sense, Mso can be seen as the "true model", or the true sampling 
density that a sample comes from. Further denote S the parameter space that 5*0 
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belongs to. For a symmetric matrix C, denote Amin(C) and A^aa^iC) the smallest 
and the largest eigenvalues, respectively. 

5.1 Sign consistency 

Our result for sign consistency of the BAVA-MIO estimator is based on the following 
simplification: the variable a^ is fixed and the term Pa,k,o-2 in (13. 5p is treated as a 
constant. Denote the constant by p. Now define 

^-3 = argmin||y-X/3||^ + A||/3||^ + pX: ^T^\t5'-'^;''^ - ^^'^^ 

Note that the log-sum function in the right hand side of f IS.ip becomes 1 1/3| |o if T3 — > 0, 
and /3'^3 in this sense can be seen as the BAVA-MIO estimator. Further define 

Eo,r, = U : sign(/3,) = sign(^=') for j = 1, 2, ■ ■ ■ ,p|, (5.2) 

i.e. the event of sign consistency for the estimator /^^^ in (15. ip . For practical purposes, 
further define Css, = n-\XlXs, + A/), Cs^s, = n'^X^^Xs,, Ds, = n-^'Xle and 
D^c = n^^/'^Xgce. We assume the following conditions hold: 

Assumption 1. For Css = "^"^(ATjXg + XI) and any S ^ S, the maxi- 
mum eigenvalue Aniax(C'5s) and the minimum eigenvalue A^i^^Css) satisfy the 
following condition: 

< Ci < AmUCss) < Amax(C5s) < C2 < OO 

Assumption 2. For the vector X'^e, ||X-^e||i = 0(n^/^). 

Assumption 3. For parameter A, we assume < A < oo. For parameter p, 

we assume < p and pn^^^'^ — )• 0. 

Assumption 1 is a special case of the Restricted Eigenvalue Assumption stated in 
Bickel et al. l| . Assumption 1 implies that the inverse of Csso exists and the ratio 
[Amax(-^J-^5) + ^]/[Amm{Xs-^s) + A] < C2/C1 is bouudcd from above for any S E S. 
Assumption 2 is equivalent to the statement n~^/^||X^e||i < oo as n — )■ oo, which 
further implies that ||Z^5o||i < oo and ||D5g||i < oo. 
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Theorem 5.1. Given that Assumption 1 to 3 hold and t^ cc n ^, we have 

as n ^ oo. 

The proof is given in Appendix C The proof wiU start by exploring the KKT condi- 
tions associated to the minimization problem stated in fIS.ip . Note that in Theorem 
IS.ll we do not state that the Irrepresentable Condition should hold. Indeed, as stated 
in Corollary IC.ll in Appendix C, even if the Irrepresentable Condition is violated, 
the result stated in Theorem 15.11 will still hold given some mild condition is imposed. 

5.2 Posterior model consistency 

Under a Bayesian framework, posterior model consistency is defined as F{Aiso I Z/") ~^ 
1 as n — )■ oo. That is, the posterior probability will put all its mass on Aiso ^^ the 
sample size goes to infinity. Note that in general, multiple true models are allowed 
under Bayesian frameworks, i.e. 5*0 may not be unique. However, for simplicity we 
only pay attention on the situation in which there is only one true model. 

Note that the posterior probability P(A^5'| y"') with S' E S can be expressed in 
terms of Bayes factors by 

p.^ I ^n. _ /(^"l Ms')f{Ms') BF{Ms',Mso;ynfiMs') . . 

^ ''^ ' Esesfiy''\Ms)f{Ms) EsesBnMs,Mso;y-)f{Msy ^ - ' 

The formulation (15. 3p implies that the event P(A^5g| y^) — )■ 1 is equivalent to the 

events BF{Ais,.Mso) ~^ ^^ all S' G 5 and 5* 7^ Sq, given that the probability 

f{M.So) is bounded from zero. It turns out that to examine whether the posterior 

probability is consistent at the true model Aiso is the same as to examine whether 

Bayes factors between the true model and other models will approach to zero. 

Here we make some assumptions on the Bayesian formulation (13. ip before stating 
the main result of posterior model consistency. 
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Assumption 4: The number of coefficients p is bounded away from infinity, 
i.e. p < oo. In addition, the prior probabihty on the true model J^So is 
bounded away from zero, i.e. f{Aiso) > 0- 

Assumption 5: We assume Xt2 < C)0. 

Assumption 6: The condition 

< iy-f{XsX^ + \I)-Y 



(y")V 

Amin(-^5o-^5 



'SqJ 

holds for allneN+, S eS\SoandO<X< cx). 

Assumption 4 states that the true model should always have positive mass under the 
prior. It is a reasonable assumption since otherwise by Bayes' theorem the posterior 
probability of M.So ^i^ be zero. Under Assumption 4, for any two arbitrary index 
sets S' and 5"', the quantity \S'\ — \S"\ is bounded away from infinity and minus 
infinity. In addition. Assumption 6 is a technical condition which ensures that the 
ratio {Xj^^Xso + XI\So\)/{XjXs + Alls'!) is smaller than 1. The assumption will be 
useful in proving convergence of the the Bayes factor BF(A^s, J^So'i U"')- 

Theorem 5.2. Given Assumptions 4 to 6 hold, there exist some constants c^ > 0, 
e > and n* > such that 

(Tit 
-y 

for all n > n* . 

The proof of Theorem 15.21 is given in Appendix D. 

5.3 Estimation consistency 

Denote (3q the coefficient vector corresponding to the true model Aiso- Define the 
expected I2 distance between estimator /3 and the true coefficient /3o by 

Ey[P-/3o||^]= [\0-mlpiy\Mdy, (5.4) 
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where p{y\ f3o) is the samphng density parametrized by the true parameter (3q. Using 
(15.11) . the BAVA-MIO estimator can be defined by ^bmio = /3'^^^°. We will derive an 
asymptotic bound for expected I2 distance between /3bmio and f3o and show that the 
expected I2 distance converges to as n —t- 00. Before deriving the asymptotic result, 
we make some finite moment assumptions on the true parameters (/3o, o"^) under the 
posterior probability. 

Assumption 7: There exist some constants C4 > and C5 > such that 
E[/3qj| y"-] < C4 for j = 1, 2, ■ ■ ■ ,p, and cr^ < C5 for all positive integer n. 

Assumption 7 ensures that the first and second moments of parameters /3o and a^ 
are bounded away from infinity as the sample size n goes large. 

Theorem 5.3. Given Assumptions 1, 3 and 1 hold, if there exist constants a > 
and e„ > such that logne„/(e„n°) — )■ 00 as n ^ 00, then 

y" ) < C7exp(-e„n") 



P(Ey[||/3BMIO-/3o|l2]>er 

for some constant C7 > 0. 



The proof of Theorem 15.31 is given in Appendix E. 

6 An extension to Generalized Linear Models 

Here we extend the proposed method, the BAVA-MIO, to parameter estimation in 
the Generalized Linear Models. Consider the density of the exponential family 

fiy\ e, ^) = exp |^^_M + d{y, ^)|, (6.1) 

where ^ is a parameter characterizing mean of the distribution, and y? is a parameter 
characterizing dispersion of the distribution. Under the exponential family (16. ip . 
variable Y has properties such that E{Y\ 9, if) = b'{9), Var{Y\ 9,ip) = h"{9)ip. De- 
note V = b'{9). For a Generalized Linear Model, there exists a link function 77 such 
that rjlu) = x'^p. The link function gives a fiexible connection between the mean 
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function v and the predictor x^/3, therefore a vahd regression can be formulated un- 
der this parametrization. In addition, by definition, v is parametrized by 6', therefore 
for notation convenience, we may write Q = 6{x^P). 

For a practical inference concern, we will not assign a prior on (p in the following 
Bayesian hierarchical formulation. We only assign priors on regression coefficients 
P and covariate indices 7. The inference concern arises from the fact that the es- 
timation of if is dependent on the function d{y,ip), and in general, d{y,ip) is case 
dependent. We will launch an investigation on how to assign a prior on ip in the 
future, but at present we only focus on inference based on priors on /3 and 7. Now 
consider the following logarithm of the joint density function 

-^ogf{/3,-f\ X,y,(p,X,K) = -2^< + d{yi,(p) 

^ i=i 



+ 2 E 7. log I ^^^^, I + const. 
.7=1 ^ '^ 



(6.2) 



The first term in ( 16. 2p is the logarithm of joint sampling density over z = l,2,---,n, 
and the second and third terms are the logarithm of the priors on /3 and covariate 
indices 7, respectively. To modify (16. 2p for BAVA-MIO estimation, we first multiply 
(16.21) with if. We then apply a majorization-minimization techinque to obtain an 
approximation to the Iq norm penalty. The BAVA-MIO estimator of /3 is defined as 
the minimizer of the approximated objective, which can be obtained by the following 
iteration scheme 

^("+1) = argminj X;y,0,(a;f/3) - 6[^.(a;f/3)] + hp\\l + p0^^P\\X (6.3) 

where p = ^[log27r(^(l - k)2(Ak2)-1]/2, and 0(™) = lim^3_,o[log(l + T^^)iW^"'^\ + 
Ta)]^^. Further by differentiating (16. 3p with respect to /3, and setting the derivatives 
to zero, we obtain the subgradient equations of /3, which are given by 

- X^Wr + X/3 + sign(/3)p0(") = 0, (6.4) 
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where r = {y-v)i{v), W = di^g{[i{v,f]h"{e,), [i {v2f]h" [O^] , ■■■, [r]'{unf]b"{en))-\ 
and u = (z/i,z/2,--- , t'™) with z/j = h'{9i). The term X^Wr in (16 ■4p is a standard 
result in parameter estimation of the Generahzed Linear Models, and its derivation 

n 

can be found in 29|. The term X^Wr allows us to formulate an iteration scheme 
to approximate the solution of the subgradient equations (16 ■4p . Here we will use the 
following iteration scheme 

(^*)(-+i)=argmin|i||t/«(^M-X/3)||^ + ^||/3||^ + p||0('")/3||,|, (6.5) 

where 

^(m) _ ^(m) _|_ Am) 
f/^"') = (H^V2-\(m) 

to approximates the solution of the subgradient equations (16.41) . The jth element 
of the iteration scheme (16. 5p can be obtained by further carrying out the following 
soft-thresholding operator coordinatewisely: 

(/3;)('+i) = (X:^;?4 + A)"5T(x:x.,^.;?<l„p0f), (6.6) 

where w\^ is the zth diagonal term of W'^^\ vl^_,j = zl — ^j'^j^ijPf with /3*, = 
0*,Y'^'^ for j' = 1, 2, ■ ■ ■ , J - 1 and P*, = 0*,)^'^ for f = j + 1, j + 2, ■ ■ ■ ,p, and 
ST{a, b) is the soft-thresholding operator defined by ST{a, h) = sign(a)(|a| — 6) + . 

We now conduct a simulation study to assess the performance of the BAVA-MIO 
estimation. We take logistic regression as the example. For the true model, we as- 
sume Yi ~ Bernoulli (Ci), where Q is parametrized in terms of predictor xjf3 via the 
link function log[(Ci)/(l — d)]- We further let the number of covariates p = 120. For 
the regression coefficients (3, we generate the jth entry /3j from a standard normal 
distribution for j = 1, 2 ■ ■ ■ ,10, and let the rest of 110 (3jS equal to zero. We simulate 
the covariate vector Xi i.i.d. from MVN(0,Sx). We consider three Ex's, the same 
as those described in Section 4.1, to generate the covariate vector. With (3 and Xj, 
we simulate Yi from Bernoulli(Ci) for the cases of n = 100 and n = 200. With three 
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different values for Ex and two different values for n, we have six scenarios in the 
simulation study. For each scenario, we generate 100 simulation runs. In each simu- 
lation run, we apply the BAVA-MIO estimation to fit a logistic regression model. We 
let hyperparameter A = ra"^/^ for all estimations. Note that for a Bernoulli variable, 
a closed form representation for the Bayes factor does not exist, therefore we only use 
ten fold cross validation for tuning parameter selection. For comparison purposes, 
we also carry out the Lasso estimation using R package "glmnet", with ten fold cross 
validation for tuning parameter selection. We collect four performance measures, the 
same as those described in Section 4.1, at each simulation run. Average values of 
the five performance measures over the 100 simulation runs are given in Table [2] to 
Table HI From these tables we can see that the BAVA-MIO estimation in general 
has slightly larger values in PMSE than the Lasso estimation has, but it gives far 
fewer number of selected covariates, more accurate results in covariate selection, and 
in some circumstances, better parameter estimation than the Lasso estimation does. 

7 Real data examples 

In this section we present two real data analyses. The first one analyzes the Diabetes 
data. Here a linear regression model describing relationships between the disease 
progression and 10 relevant variables will be fitted. The second one analyze Golub's 
Leukemia gene expression data. The data set consists of 72 Leukemia patients and 
is considered as one of the benchmark data sets in high dimensional classification 
problems. We will apply BAVA-MIO estimation to fit logistic regression models in 
identifying disease types for the patients. 

7.1 Diabetes data 

The Diabetes data contains a measure on disease progression and 10 covariates: age, 
sex, the BMI index, blood pressure and six related variables for 442 diabetes patients. 
In our analysis, each covariate has been rescaled to have mean zero and variance 1, 
and the response variable has been centered around its mean. All estimations are 
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based on the rescaled covariates and centered response variable. For hyper param- 
eters, we let {ti,T2) = (1, 1) and A = 0.2 x ^yp\og{p)/n ^ 0.049. We perform two 
BAVA-MIO estimations. The first one uses the Bayes factor (BMIO-BF) while the 
second one uses ten fold cross validation (BMIO-CV) for tuning parameter selection. 
The results are shown in the first two columns of Table O From the results we can 
see the BAVA-MIO estimation using the Bayes factor leads to a covariate selection 
sparser than its counterpart using ten fold cross validation. We also run another 
100 estimations based on sampling half of the 442 subjects without replacement to 
calculate the inclusion probabilities for the 10 covariates. For each covariate, the in- 
clusion probability is defined as the proportion of occurrences of non-zero estimated 
values appearing in the 100 subsampling estimations. We compare the results from 
the BAVA-MIO estimations with the results from three other estimation approaches: 
g-prior, hyper-g, and BIG. All the three estimations are carried out using R pack- 
age "BAS", which is developed by Clyde, Ghosh and Liftman [6] as the companion 



software to the paper of Liang et al. 27|. These results are shown in the last five 



columns of Table [51 For the three estimations using the BAS package, we report the 
models estimated with the highest marginalized likelihood. Again, the results show 
that the BAVA-MIO using the Bayes factor has the sparest covariate selection among 
the five proposed approaches. Among the 10 inclusion probabilities estimated via the 
BMIO-BF estimation, only two are above 0.5, compared to five for the BMIO-GV 
estimation, six for the g-prior and the BIG estimations, and seven for the hyper-g 
estimation. 

7.2 Golub's Leukemia data 

The Leukemia gene expression data, adopted from R package "golubEsets", is orig- 



inally from [18(1 . It consists of gene expression profiles for 72 Leukemia patients, of 
which 47 are diagnosed with acute lymphoblastic leukemia (ALL) and 25 are diag- 
nosed with acute myeloid leukemia (AML). Each profile has 7129 gene expression 
values measured by Affymetrix Hgu6800 chips. The data set is further divided into 
the training set, which consists of 27 ALL patients and 11 AML patients, and test 
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set, which consists of 20 ALL patients and 14 AML patients. Our aim is to identify a 
patient's disease type with a small set of genes. The data set is processed as follows. 
The disease type is labeled with for the acute lymphoblastic leukemia and 1 for the 
acute myeloid leukemia. Each covariate is first rescaled to have a range greater than 
or equal to zero. Then it is under a suitable logarithm transform before rescaled again 
to have mean and variance 1. For constructing the classification rule, we apply the 
BAVA-MIO estimation to fit logistic regression models with the training data. We 
parametrize hyperparameter A = rfypXogp/n and perform three estimations with 
7] = 0.05, 0.1 and 0.5. The tuning parameter is selected via five fold cross validation 
and the resulting estimates are termed BMIO-CV I, BMIO-CV II and BMIO-CV III, 
respectively. With estimated regression coefficients, we calculate the label probabil- 
ity for each patient, and classify those with label probabilities smaller than 0.5 to 
the acute lymphoblastic leukemia group, and those with label probabilities greater 
than 0.5 to the acute myeloid leukemia group. The estimated label probabilities 
for the total 72 patients under the three BAVA-MIO estimations are shown in Fig- 
ure [HI The corresponding classification results are reported in Table [6l along with 



classification results on the same data set done by Golub et al. 181 and four other 



estimation approaches [47|,[3J],[8|,and [lO| aiming to tackle high-dimensional classi- 
fication problems . The results show that BAVA-MIO-based classification rules tend 
to use less numbers of genes in identifying a patient's disease type. However, even 
with smaller numbers of genes, the BAVA-MIO-based classification rules can still 
generate results comparable with those provided by other benchmark methods. 

8 Concluding remarks 

In this paper we have provided a novel approach, the BAVA-MIO, for MAP estima- 
tion in Bayesian regression models in which coefficients are assigned with Gaussian- 
based spike and slab priors weighted by Bernoulli variables. Simulation studies show 
that the resulting estimates can lead to better performances in variable selection in 
various circumstances, even in severe situations in which the Irrepresentable Condi- 
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tion is violated. Asymptotic results stated in Section 5 further support the simulation 
studies. The proposed method is then extended to parameter estimation in the Gen- 
eralized Linear Models. One real data example shows that logistic regression with 
BAVA-MIO estimates can use less covariates to achieve classification results that are 
comparable to those provided by other benchmark methods. 

One important issue we did not pay much attention in this paper is the impacts 
of hyperparameters on estimation results. Here we provide some possible modifi- 
cations in addressing this issue. First, an equally spaced grid may be constructed 
for hyperparameter A so that the estimation procedure can be carried out along the 
grids on A and "${1^). Another possible modification is to drop the prior assumption 
on o"^ and treat it as a constant. In this approach the impact of cr'^ on parameter 
estimation can be dealed together with the tuning parameter \I'(/t;). This approach 
has been adopted in Section 6 for parameter estimation in the Generalized Linear 
Models. 
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Appendix A: Derivation of the Soft-thresholding 

Operator Representation 

In this paper, the soft-thresholding operator f l3.16p is used to build a coordinate 
descent algorithm for obtaining the minimizer of the objective function fl3.15p . Gen- 
erally speaking, a coordinate algroithm is an iteration procedure aiming to minimize 
an objective function coordinatewisely. Here the word " coordinatewisely" means that 
at each iteration only one coordinate of the minimizer is considered for optimization 
given that all other coordinates are fixed. Here we focus on the jth coordinate and 
derive an explicit form for the soft-thresholding operator (13.161) . For simplicity, we 
drop the index mi in fl3.16p and define p = P\,n^a'^<t>- Define (3^ = f3j if f3j > and 
/3+ = a (3j <0. Further define (3' = /3j if (3j < and (3r = ii (3j > 0. With 
the definitions given above, we can write \f3j\ = (3^ — Pj and Pj = 13^ + I3j . Since 
we only focus on the jth coordinate, we rewrite the squared I2 loss \\y — XPW^ as 
Sr=i(^i,-i ~ ^ijPjY for notation convenience, where fj_j = i/i — J2j'^j^ij'l^j' ^^^ 
/3j/'s are fixed constants. Note that given all other coordinates are fixed, the problem 
of minimizing the objective function fl3.15p with respect to /3j is equivalent to the 
following constrained optimization problem: 

minimize ^ ( fj _j - Xij{(3j ^ l^j) ) 

subject to (/3+ + Pjf < h, {/3+ - Pj) < t2, -13+ < 0, (3j < 0. (A.l) 



The Lagrangian associated to problem (lA.ip is given by 



L{(3t,(3i ,\,p,pi,p2) 






+A[(/3+ + pry - ii] + p(/3+ - p- - t2) + pi(-/3+) + P2/3,- 
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The KKT conditions associated to problem f lA.l|) are given by 



(/3+ + /3-)2-ti<0, A[(/3+ + /3-)2-ti] = 

Pt-(3--h<0, p{l3l-(3j-h) = 

-/3;<0, Pi(-/3+) = 

/3-<0, P2/37 = 

A, p, Pi, P2 > 

n n 

+ 2,3+^4 + 2A;3+ + p- Pi = 



■2E 

n 

-2E 



'^zj 'if—j 



n 



'^ij 'if—j 



+ 2/37E4 + 2A/3--P + 



P2 



0, 



(A.2) 



i=l 



For the third hne in the KKT conditions (lA.2p . the complementary slackness con- 
dition further implies that if /3^ > then pi = 0, and if pi > then /3^ = 0. A 



similar argument can be applied to the fourth line in flA.2l) . With the arguments 
given above, the sixth and seventh lines in flA.2p jointly imply that 



E 



•^ij'i,—j 



< 



(A.3) 



if Pj = 0, and otherwise if (3j ^ 0. Now with condition ( JA.31) and the sixth line in 
(1A.2I) . we can derive a closed form solution for 13 j~, which is given by 

-1 / n 



E4 + M E 



•^ij^i,—j 



(A.4) 



A similar argument can be applied to derive a closed form solution for /3 , which is 
given by 



:4 + A 



-1 / n 



E 



/^7 = (E 

Combining (IA.4P and (lA.Sp . we get 



p 



(A.5) 



j=i 



«=i 



•"^ii'^^i-ji 2 i ' 



(A.6) 



where ST{a, b) is a soft-thresholding operator defined by ST{a, h) = sign(a)(|a| — 6). 



37 



Appendix B: Proofs of Proposition [3I2] and Propo- 



sition 13.3 



Proof of Proposition \3.2i Note that 

log(l + r3 ) 
log(l + T-'\(3,\) - /(/3, ^ 0) log(l + T^') 

l0g{l + T,-') 

log(r3 + 1/3,1) - /(/3, ^ 0) log(r3 + 1) + (/(/3, ^^ 0) - 1) logJTs) 
- log(r3) + log(r3 + 1) 

1 + ^3-^1/3,1) _ ... , .. log(r3) - log(r3 



For the case of /3j = 
log 

log^x-r -3 

For the case of /3j 7^ 0, 



log(l + r3 ) 



-log(r3) + log(r3 + l) 
log(r3+ 1/3,1) -log(r3 + l) 



0. 



(B.l) 



(B.2) 



- log(r3) + log(r3 + 1) 
The numerator in ( ]B.2p is bounded from below and from above with /3j 7^ and 
T^ G [0, 1), therefore there exist two positive constants C2, C3 G M such that 

- 00 < -C2 < log(r3 + |/3j|) - log(r3 + 1) < C3 < 00, 

for J = 1, 2, ■ ■ ■ ,p, and we can bound the numerator in a way that 



log(r3 + 1/3,1) -log(r3 + 1)1 <Ci, 



(B.3) 



where Ci = C2 V C3. Further note that for r-^ G [0, 1), the denominator in (IB.2P 
is always greater than zero. Therefore by using (IB.ip . (IB.3P and the fact that the 
denominator in (IB.2P is a positive constant, we can bound \\gi{(5;T-i) — 5'2(/3)||i in a 
way such that 



|^i(/3;r3)-^2(/3)|h 



p 



< Ci 



log(l + r3-^) ^''^" 


<y 


1 


^l 


-logr3 + log(r3 + 1) 


p ^ 



0) 



log Tg ^ 
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which completes the proof. 



D 



Proof of Proposition 1 5*. 31 By direct calculation, we have 



9iil3j;r3) 



lim 



^ X ii^-log(^3)+log(r3 + |/3,|) 



On the other hand, 



lim 



m 



1/3,1^ .3^0 [-log(r3)+log(r3 + l)] |/3,| 






X lim 

r3— >oo 



^3 log 






logexp(|/3j|) 
logexp(l) 



xl = |/3,|-' (B.4) 



T-slog 



T-3+1 
^3 



which completes the proof. 
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Appendix C: Proof of Theorem 15.1 



We will use the following notations in the proof. For two vector a = (ai, 02, • • • , Op) 
and b = (&i,&2, ■ ■ ■ ,bp), the notation \a\ < \b\ means pairwise inequalities hold for 



elements in a and b, i.e. |a,| < \bj\ for j = 1,2, 



,p. Similar operations are ap- 



plicable to \a\ > \b\, \a\ > \b\, \a\ < \b\ and the function max(a, 6). In addition, we 
denote 1*. the p*-dimensional vector with entries all equal to 1. 



Proof of Theorem I5.il Define w"^^ = P'^'^ — (5. The sign consistency implies that 



for /3j > 0, wf 



Pf 



13 j > -jSj] for 13 j < 0, w^ = (3^ - l3j < -I3j; and for l3j 



0, 



wY = P, 



T"3 



/3, = 0. In addition, it can be shown that w'^^ is the minimizer of the 



following function 



L(w) 



Xwlli + X\\w + 



^^\og{l + r,'\w + P,\) 



i=i 



log(l + T^') 
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which means that u;'^^ is the solution for the subgradient equations given by 







2 1 ^"^ I + 2A 



P^' 



So 



M 



+ 



sign{f3g)/iTs + \Pg\) 



P 



sign(/3^|)/(r3 + |/35|l)/ ^1 + ^3 
Then following conditions are necessary and sufficient for the event Eq^^-,^: 

P ■ sign{/3so) 



El 
E3 



P 



2{Ts + \w-'+Pso\)^ogil + T, 



-1\ 



P: 



2r3log(l + r3-i) l^"l - s So s s - 



P 



2X3 log(l + r,-i) ^1^51 



We will restrict our discussion on the following case 



Eo 



fi : \w^'\ < \Pso\ 



(C.l) 



(C.2) 



Note that if flC.2p holds, then /3J^ 7^ also holds, and the above sign consistency 



arguments on Wg^ will also hold. Technically we express |/3^ | 



?"=* I - 5 and given flC:2|) 
6 > 0. Since E2 is a restricted event, we may write Ei n E2 H E^ C Eq^.^. We first 



solve the equations in Ei to obtain a representation for Wg^ in terms of Csso and 
Dso- The representation is given by 

P ■ sign(/3sj 



w 



■T3 



n ^^^4 I n 



1/2^ 



So 



_-l^ 



A/350 . 



(C.3) 



2(r3 + 5)log(l + r3 
Then by plugging (]C.3|) in the left hand side of E2, we obtain the following inequality: 



n-'/'C^LD,. 



'SSo-^So 



'^ ^SSo 



P ■ sign(/3, 



So J 



2{T, + 6)\og{l + T.i'] 



A/3, 



So 



< n-'^^\C^}soDso\ + £ 



^-1 / sign(/3g, 

^^° ^ V3 + 6) 10g(l + T; 



2A^ 

3-) ^ 7^- 



(C.4) 



If the right hand side of flC.4p is smaller than \Pso\j then E2 will hold. Denote the 
event by E2. Equivalently, E2 can be written as 



^2= /3:|C^iI^5ol<^'/'l/35o|- 



P 



2nV2 



^-1 



55o 



sign (A 



So J 



:r3 + 5)log(l + r3 



2A^ 

+ -/3.0 
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Obviously, E'2 (^ EiH E2. On the other hand, by plugging fIC.SP in the middle term 
of E3 and taking absolute value on it, we obtain the following inequality: 



n^ CscsqCss^Dso 



1 , p ■ sign( /3, 

^.=.oC^.5„l^2(r3 + 5)log 






< n 



1/2 



Cs'=5oCs5o^5o - Ds- 



+ 



Cs'=SoC's5o 



P ■ sign(/3, 



•Soy 



nv + M 



'So 



(C.5) 



,2(r3 + 5)log(l + r3- 

If the right hand side of (]C.5|) is smaller than p[2r3log(l + T.^^)]^^ltgci, then E^ will 
hold. Denote the event by E'^. Equivalently, E'^ can be written as 



E' 



(3 



< 



P 



2r3log(l + r3-i 



■|So1 



ni/2 

1 f r-i ■ sign(/3gj 2r3log(l + t^^)X 



^''\ {rs + 6) 



P 



(C.6) 



Obviously E'^ C Ei n E3. Now by using the fact that F{Ei n E2 n E-^] y") > 
P(E^nE^| y"), we have 

= l-F{E'^'UE'/\y-) 

Here £'2'^ and E'g'^ are the complements of -E'2 ^^*i -^35 respectively. To bound the 
probability P(-E2''^| y") + P(-E3''^| y"), first note that by Assumption 2, we have 



!£> 



•So 1 1 00 



< WD 



SoWl 



n 



i/2||X^e|| < n-i/2||x^e||i < 00. Then \\C^' D 

I I on I I — II I I -L II r)r>n 



SSq-'^SoWoo 



< 



^mm{Csso) ^\\Dso\\oo < oo by Assumption 1. In addition, by Assumption 3, the 
second term in the right hand side of the inequality in E2 will goes to as ra — )■ 00. 



Therefore for E^'^, we have 



2 5 



¥{E'^'\y^)<¥(^\\CsiDs,\\oo 



> n 



1/2 



max |/3j 
j 



^0 



(C.7) 
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as n — 7- oo. On the other hand, for the term in the left hand side of the ineuqahty 
in E'^, we have 

1 1 Cs'^So ^SSq ^So " Ds^ 1 1 oo < II Cs'^So Dsq 1 1 oo Amin {CsSq ) + 1 1 Ds^ 1 1 oo 

< r,-^\B II Arnax(Xj^X5j 

(C.8) 



which is also bounded from above by Assumption 1 and Assumption 2. Further note 
that since ra oc n^^ , therefore for ts = c^n"^ with < cg < oo we have 

r3log(l + r3-')n'/'= '°g'^+;f-'" ^0 (C.9) 

as n — > OO. In addition, the last term in the right hand side of the inequality in E'.^ 
also approaches to as n — )■ oo under the assumption that r-^ oc n^^. This event 
guarantees that the quantity in the right hand side of the inequality in E'^ will remain 
non-negative as n — )■ oo. Therefore by using fIC.Sp . we can bound the probability of 
E'r by 



nE'^\y-) < p(^||D5g| 



AmaxlAT^^Xso) , iin II ^ P 



2/" 



'A^UXlXso) + X " "°"°° 2r3log(l + r3-i)nV2 

(CIO) 

which approaches to as n — ;■ oo by (lC.9p . By (IC.7P and (IC.lOp . we have F{E2'^\ ?/") + 
P(-E'3'^| y"-) — !■ as n — > oo, therefore P(£'o,t3| y") — )■ 1 as n — > oo, which completes 
the proof. D 

The following proposition summarizes the invariance of the BAVA-MIO estimator 
under the Irrepresentable Condition. 

Corollary C.l. Assume that 

r c ■0< ^ r^-i /'sign(/3so 



^'''^^''^y'KT6) 



<l|5g|-00, (C.ll) 

and 6 > and t^ oc n^^. Then given Assumptions 1 and 3 hold, for estimator defined 
in 115. 1\) . the inequality stated in E'^ will hold under the following condition: 

l\ss\ < \Cs^SoCsisign{(3s,)\ < If^ej • oo. (C.12) 
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Proof of Corollary \C.1[ We start the proof by defining 

ICs-SoCsSo^ig'^it^So] 



Ctl 



02 



max I 1 



max I 1. — : — ; — — 



|Cs-5oCs5oSign(/3so)l 

Given Assumption 1, we have ai < cxd, and given (IC.12p . we have a2 < oc. The 
second term in the right hand side of the inequahty in E'.^ can be bounded from below 
in a way such that 
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(C.13) 



The final term in the right hand side of (IC.13P approaches to zero as n — )■ oo given 
that T oc n~^ and Assumption 3 holds. The first two terms in the right hand side of 
(1C.13|) can be bounded in a way such that 
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(C.14) 



Given that T3 oc n~^, the second term in the right hand side of ( JC.14P approaches 
to If^ci • as n — !■ 00, which implies that E'^ can still hold under (]C.12p . i.e. the 
Irrepresentable Condition is violated. D 



Appendix D: Proof of Theorem 15.2 



We first prove the following Lemma. 
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Lemma D.l. Assume S ^ Sq. Then given Assumption 4 to 6 hold, there exist some 
positive constants n* and e such that 



logp(y"| Mso) - logp(?/'^| Ms) + logp{Mso) - ^ogp{Ms) > ne. 



for n > n* 



Proof of Lemma \D.l\ For the Bayesian hierarchical model stated in (13. ip . the Bayes' 
factor between Jvls and A^So is given by 



BY{Ms^Ms,;y' 



Piy^'l 7o,Ti,r2,A) 



A<J 



l\^\XlXs, + XIiSo\\'^'r{in+\S\-l + 2n)/2) 



TT/ |XjX5 + A/|5||i/2 r((n+|5o|-l + 2ri)/2) 
m^Xs^Xl + Xl^yV + 2Ar2] K"+l^l-^^-^+^-^)/^] 
[{y-V{XsX^ + XQ-V + 2Xr,] I("+I^l-+-^)/^] 

. .rr, rr N -1 n [(n+l'S|-l+2ri)/2l 

{y-f{Xs,Xl + Xh) V + 2Ar2 

_(y")^(XsXj + A4)-'r + 2Ar2_ 
xi^(y",7o,7,ri,r2,A), (D.l) 



where Ag = 15*1 — l^o], and iJ(n; 7,70, ri, r2, A) is some function that collects the 
remainder terms. It can be shown that under Assumption 2, Ag is bounded, and 
further by Assumption 3, it can be shown that if (y'^,7,70, ri,r2, A) is bounded for 
all n. By Assumption 4, we can bound the term (|/")-^(X5(,Xj^ + A/„) y" from 
above in a way such that 
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which further implies that 
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(D.3) 
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The inequality f lD.3|l implies that the logarithm of Bayes's factor flD.l|) is bounded 
away from above for all n. Define 

2 °^ l^ (^n)T ^XsXl + A/„) '\- + 2Ar2 
7^2 = (|5|-l + ri);ri + logi7(y",7o,7,ri,r2,A). 



We can express the logarithm of Bayes' factor ( 1D.1I) in terms of i^i and K2 as 



logBF(7W5,-M5o;2/") = logp(l/"l-M5)-logp(i/1-M5o) 

= ni^i + iTs. (D.4) 



Since i^i < and i^2 is bounded, (ID.4p will converge to —00 as n — )■ 00. From (ID.4p 
we can see that 

logp(?/"| MsJ - logp(2/"| A^s) + logp(XsJ - \ogp{Ms) 
PiMso) 



-riKi - K2 + log 
n( -Ki- 



p{Ms) 
K2 - logp(.M5j + logp(.Mg) 
n 

Denote K3 = K2 — \ogp{Mso) + ^ogp{Ms)- Note that —Ki > 0, therefore, for 
K^ < 0, we let ei such that < ei < — -ft'i- For K^, > 0, we choose some N > n* so 
that — -ft'i — K3/N > —Ki — K^/n* = 62 > 0. Let e = ei A £2 and n* satisfying the 
condition given above, we complete the proof. D 

Proof of Theorem \5.S[ Note that the posterior probability P(A^5(,| y") can be ex- 
pressed in terms of Bayes' factors as 



The first step for proving the theorem is to bound the tail probability fID.SP from 



below. We focus on deriving an upper bound for the second term in the right hand 
side of (]D.5|) . One trick to derive this upper bound is to derive a lower bound for 



the denominator and an upper bound for the numerator. Define 

ne 



S^ = ^S:\ogBFiMs,Mso;yn>-^ 
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Now for the denominator, we have 

J2^FiMs,Ms,;ynpiMs) 
s&s 

= J2^^(^s,Mso;ynpiMs)+J2^^(^s,Mso;ylpiMs) 

seSi sesi 

^ I'^il E T^exp[\ogBF{Ms,Ms,;yn + ^ogp{Ms)]. (D.6) 

ses, I'^il 



Since flD.6P is a sum of convex functions, a lower bound can be derived in a way such 
that 



I'^il E T^exp[\ogBF{Ms,Mso;yn + ^ogp{Ms)] 
> |5i|exp 1^ Yl ^ogBFiMs,Ms,;yn + TJ]Y1 ^ogpiMs)] 

*- ' ^' se<Si ^^^ ' ^^ ' ^' seSi ^ 

Now we turn to derive an upper bound for the numerator. Define 



5 = arg max < \ogBF {Ms', Ms,,] y"") +^ogp{M s') 

S'£S\So { 

An upper bound for the numerator can be derived in a way that 
J2 exp I logBFiMs'^Mso^y"") + ^ogp{Ms') 



< 7F max exp <^ logBF(A^5/,A^5„;|/") + logp(A^5/ 

S'e5\5o 



< 2^exp max {\ogBF{Ms',Ms,,\y'') ^\ogp{Ms' 

\ 5'e5\So [ 

= 2Pexp|logBF(A^^,A^5o;i/")+logp(A^5)|. (D.8) 



Further define 



2^ f 1 1 

C3 = ^exp -^ V \ogp{Ms)\. (D.9) 



Here |iSi| is the number of elements in S\. Now with flD.7p . flD.81) . flD.9P and the 



result in Lemma 1, the tail probability in the right hand side of fID.Sp can be bounded 
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from below in a way such that 

Eg.^g, explhgBFjMs', Msq-, y") + logpjMs')] 
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> 1 - C3 exp I - y 1 , 
which completes the proof. 
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Appendix E: Proof of Theorem 15.3 



Proof of Theorem 15.31 First define the ridge estimator /^''''^se ]^j 



/3"'^^^ = argmin ||y-X/3||^ + A||/3||; 



Note that the ridge estimator has a closed form solution that /3''''^se _ (^x'^X + 
XIp)^^X^y. The definitions of /3^^ and /J"'^^'^ imply that 



\y-Xl3^-^\\l + X\\/3^ 



||^_j5^^ridge||2_^pridge||2 

{P^^)^{X^X + XIp)P^-' - [P'^'^^YiX^X + A/p)/3"'^s^ 



The last two terms in the right hand side of f lE.ip can be rearranged as 
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Therefore 
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:e||2 



^^rs _ ^ridge^T^^T^ ^ A/p)(r=^ - P'"^^') , 
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and in turn, 

(Amin(X^X) + A) 1 1/3"=^ - P"'^^^ < 0^^ - /3"'is^)^(X^X + XIp) 0^-' - ^"^*5^) 

= \\y-Xl3^^\\l + X\\r'\\l 

by definition of (3'^^, wfiere pi = p[log(l + Tg"^)]"^- Rearranging (IE.2P we liave 

Denote 5* tlie estimated index set of /3J^ 7^ 0. By using tlie inequality log 6' < ^ — 1 
for 6' > 0, we can bound the right hand side of flE.Sp in a way such that 



^^(r3+|^-|)log(l + r3-i) ^J^, log(l + r3-^) 

(E.4) 

Note that the second term in the right hand side of flE.4l) approaches to p\S^\ as 
Ts — > with /3" ^'^ 7^ for j G S''^. In addition, the term 

1 



(r3 + |^-3|)log(l + r3-i; 



^0 



as Ts — 7- given that /3J^ 7^ for j G S*. Therefore 

" n + r^'0''''^'^ 

By using the inequality ( IE. 51) with rs — )■ and 15*^1 < p, we have 



limpiVlog ^ i < p|^% (E.5) 



In addition, as shown in the Theorem 1 of Zou and Zhang 491], the quantity ||/3"'^^'^ — 
/^olll can be bounded by 

"^ ~^°"^- (A..(X-X) + A)^ (^-'^ 
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By definition, /3bmio = /3'^^^°. By using the results in flE.SP and f lE.7p and Assump- 
tion 1, we can bound the quantity IEy[||/3BMio — /^oHi] i^ ^ way such that 



Ey[||/3BMIO-/3o||2] < 2Ey[||/3BMIO 



+ Ey[P"S<^^-/3o| 
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< 



< 



2pp{A^UX^X) + X) , 4A2||/3o||i + 4Mmax(X'^X)a2 
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(nci + A)2 

2ppc2 + 4n-U2||/Jo||^ + 4pc2a^ 

nc? 



(E.8) 



Further by using Markov's inequahty, the probabihty P(Ey [| |/9bmio — /3o| I2] > ^n\ y") 
can be bounded by 

^ ^f 2ppc2 + 4:n-^X^\\(3o\\l + Apc2cr^ ^ ^ 



F Ey[||/3BMio-/3o||2] >en 
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racf 
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2ppc2 + 4n X pc4 + 4pc2C5 



nc?e„ 



(E.9) 



Let C7 = p{2pc2 + 4A2c4 + 4c2C5)/c^. Then with assumption that there exists a 
constant a > such that lognen/(e„n") — t- 00 as n — t- 00, ( IE. 91) can be bounded in 
a way that 



P(Ey[||A3MIO-/3o||2] >en 



y" <C7exp(-e„n") 



which completes the proof. 
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_0 penalty and its approximations 



_0 + squared i_2 penaity and its approximations 
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Figure 1: The penalty functions and related approximations. 
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Path for beta (n = 100, p= 1000) 



Path for beta (n = 100, p= 1000) 
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Figure 2: Estimation results for the toy example. 
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Figure 3: Simulation results given SNR = 1. Top: Model 1 (covariance matrix with 
off-diagonal terms equal to 0); Middle: Model 2 (covariance matrix with off-diagonal 
terms equal to 0.5); Bottom: Model 3 (covariance matrix with off-diagonal terms 
following a Toepolitz structure). First column: standardized /2-distance btween es- 
timated and true values; Second column: sign function- adjusted false positive rate; 
Thrid column: prediction mean squared error; Fourth column: number of non-zero 
estimates. 
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^0,SNR = 3.16 
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Figure 4: Simulation results given SNR = 3.16. Top: Model 1 (covariance ma- 
trix with off-diagonal terms equal to 0); Middle: Model 2 (covariance matrix with 
off-diagonal terms equal to 0.5); Bottom: Model 3 (covariance matrix with off- 
diagonal terms following a Toepolitz structure). First column: standardized I2- 
distance btween estimated and true values; Second column: sign function-adjusted 
false positive rate; Thrid column: prediction mean squared error; Fourth column: 
number of non-zero estimates. 
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Figure 5: Simulation results given SNR = 7.07. Top: Model 1 (covariance ma- 
trix with off-diagonal terms equal to 0); Middle: Model 2 (covariance matrix with 
off-diagonal terms equal to 0.5); Bottom: Model 3 (covariance matrix with off- 
diagonal terms following a Toepolitz structure). First column: standardized I2- 
distance btween estimated and true values; Second column: sign function-adjusted 
false positive rate; Thrid column: prediction mean squared error; Fourth column: 
number of non-zero estimates. 
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Figure 6: Heapmaps based on GAP (the generalized association plots) for the S-FPR 
(left) and rankings of the S-FPR (right) of the five estimation approaches under the 
45 simulation scenarios. 
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Figure 7: Scatter plots for the sign probability F{S = So) against the Irrepresentable 
Statistic under different signal-to-noise ratios. 



56 



CV error 

- - TBSt error 

Smallest CV error 




iJ 




\ 




; V 



0.0 0.2 



log(#otgenes + 1) 

Smallest CV error 




y 


A 


y 









CV error 

- - Test error 

Smallasl CV error 














; V 


\ 


y 




\r\ 


fw^\V 



CV error 

- - Test error 

Smallest CV error 












;\ 


1 - -'''' 



0.0 0.2 



log(# of genes + 1) 

Smallasl CV error 


/ 


r 


y 







log(# of genes + 1) 

Smallest CV error 


j 


J" 


/ 







0.0 0.2 



0.0 0.2 



+ +-H- + + 
+ 
+ 
+ 






++ + 
+ + 


+ 






+ 


'^+ 






A + 




' 




- 


^ 


' 


,' 


A 






% 




^ 




a ALL (train ing) 
+ AML (training) 


.. - 











20 30 40 50 60 70 

Sample index 





Figure 8: Results of the BAVA-MIO estimation for Golub's gene expression data. 
Left: ri = 0.05; Center rj = 0.1; Right rj = 0.5. Top: the CV error and test error 
along the fitting path; Middle: the number of selected genes along the fitting path; 
Bottom: scatter plot for estimated label probabilities. The vertical dash line in each 
plot in the top two panels indicates where the tuning parameter is selected. 
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Table 1: The sign probability P(S' = Sq) under different signal-to-noise ratios. Each 
value is calculated by averaging over 100 simulation runs, and the corresponding 
standard error is given in the bracket. The term corr. in the second line of each 
panel is the squared correlation between the sign probability and the irrepresentable 
statistic. We use Kendall's r in the correlation calculation. 



Name 


SNR = 10 


SNR ==5 


SNR = 2 


SNR = 1 


BMIO 
corr. 


.398 (.030) 
.052 


.338 (.030) 
.047 


.077 (.018) 
.180 


.023 (.006) 
.110 


Lasso 
corr. 


.314 (.047) 
.418 


.203 (.030) 
.232 


.072 (.016) 
.194 


.022 (.006) 
.107 
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Table 2: The BAVA-MIO GLM estimation for the first model (covariance matrix 
with off-diagonal terms equal to 0). Each value is calculated by averaging over 
100 simulation runs, and the corresponding standard error is given in the bracket. 
BMIO-CV: the BAVA-MIO estimation using ten-fold cross validation; Lasso: the 
Lasso estimation. The top panel: n = 100; The bottom panel: n = 200. 





PMSE 


Z2-dis 


S-FPR 


\S\ 


BMIO-CV 
GLM-Lasso 


.186 (.004) 
.173 (.003) 


.039 (.002) 
.044 (.004) 


.305 (.031) 
.680 (.012) 


11.39 (1.788) 
21.34 (1.050) 


BMIO-CV 
GLM-Lasso 


.145 (.003) 
.144 (.002) 


.018 (.001) 
.026 (.001) 


.176 (.021) 
.694 (.011) 


7.54 (.549) 
27.65 (1.141) 



Table 3: The BAVA-MIO GLM estimation for the second model (covariance matrix 
with off-diagonal terms equal to 0.5). Each value is calculated by averaging over 
100 simulation runs, and the corresponding standard error is given in the bracket. 
BMIO-CV: the BAVA-MIO estimation using ten-fold cross validation; Lasso: the 
Lasso estimation. The top panel: n = 100; The bottom panel: n = 200. 





PMSE 


/2-dis 


S-FPR 


\S\ 


BMIO-CV 
GLM-Lasso 


.180 (.004) 
.169 (.003) 


.055 (.003) 
.052 (.003) 


.455 (.031) 
.654 (.018) 


13.86 (1.841) 
17.54 (.966) 


BMIO-CV 
GLM-Lasso 


.157 (.003) 
.154 (.003) 


.027 (.002) 
.030 (.001) 


.327 (.025) 
.654 (.012) 


8.58 (.564) 
20.67 (.908) 
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Table 4: The BAVA-MIO GLM estimation for the third model (covariance matrix 
with off-diagonal terms following a Toepolitz structure). Each value is calculated by 
averaging over 100 simulation runs, and the corresponding standard error is given in 
the bracket. BMIO-CV: the BAVA-MIO estimation using ten-fold cross validation; 
Lasso: the Lasso estimation. The top panel: n = 100; The bottom panel: n = 200. 





PMSE 


/2-dis 


S-FPR 


|5| 


BMIO-CV 
GLM-Lasso 


.180 (.004) 
.170 (.003) 


.046 (.003) 
.047 (.003) 


.271 (.028) 
.668 (.016) 


8.28 (1.187) 
19.18 (.914) 


BMIO-CV 
GLM-Lasso 


.152 (.004) 
.150 (.003) 


.022 (.001) 
.027 (.001) 


.203 (.023) 
.675 (.014) 


7.19 (.478) 
23.50 (1.045) 
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Table 5: Estimation results for the Diabetes data. For BMIO-BF* and BMIO-CV*, 
we report the median of the 100 subsampling estimates. The value in the bracket 
is the inclusion probability of the covariate based on the 100 subsampling samples. 
For g-prior, hyper-g and BIG, the value in the bracket is the posterior inclusion 
probability of the covariate. 



Name 


BMIO-BF 


BMIO-CV 


BMIO-BF* 


BMIO-CV* 


g-prior (BAS) 


hyper-g (BAS) 


BIC (BAS) 


age 


0.00 


0.00 


0.00 (0.00) 


0.00 (0.30) 


0.00 (0.11) 


0.00 (0.33) 


0.00 (0.05) 


sex 


0.00 


-11.08 


0.00 (0.00) 


-9.23 (0.64) 


-10.64 (0.99) 


-8.02 (0.97) 


-10.71 (0.98) 


bmi 


32.14 


25.06 


32.21 (1.00) 


25.94 (1.00) 


24.96 (1.00) 


19.00 (1.00) 


25.37 (1.00) 


map 


0.00 


15.01 


0.00 (0.10) 


14.85 (0.92) 


15.29 (1.00) 


11.55 (1.00) 


15.53 (1.00) 


tc 


0.00 


-6.97 


0.00 (0.00) 


0.00 (0.33) 


-16.62 (0.71) 


-13.54 (0.75) 


0.00 (0.57) 


Idl 


0.00 


0.00 


0.00 (0.00) 


0.00 (0.28) 


8.51 (0.50) 


6.51 (0.59) 


0.00 (0.38) 


hdl 


0.00 


-11.20 


0.00 (0.00) 


-8.89 (0.81) 


0.00 (0.49) 


0.00 (0.57) 


-7.29 (0.57) 


tch 


0.00 


0.00 


0.00 (0.00) 


0.00 (0.27) 


0.00 (0.30) 


0.00 (0.48) 


0.00 (0.20) 


Itg 


29.28 


25.72 


29.13 (0.99) 


24.10 (1.00) 


29.13 (1.00) 


22.29 (1.00) 


28.31 (1.00) 


glu 


0.00 


3.44 


0.00 (0.01) 


0.00 (0.47) 


0.00 (0.17) 


0.00 (0.41) 


0.00 (0.07) 
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Table 6: Classification results for Golub's gene expression data. 
Method CV-error Test-error jf^ of genes 



Golub et al. (1999) 




3/38 


4/34 


50 


Elastic Net (Zou&Hastie, 


2005) 


3/38 


0/34 


45 


^i-pen GLM (Park&Hasti 


e, 2007) 


1/38 


2/34 


23 


SIS-SCAD-LD (Fan&Lv, 2008) 


0/38 


1/34 


16 


FAIR (Fan&Fan, 2008) 




1/38 


1/34 


11 


BMIO-CV I 




1/38 


1/34 


8 


BMIO-CV II 




1/38 


1/34 


9 


BMIO-CV III 




1/38 


0/34 


23 
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